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Abstract 

We study the problem of abstracting a table of data about individuals so that no selection query can 
identify fewer than k individuals. As is common in existing work on this k-anonymization problem, 
the means we investigate to perform this anonymization is to generalize values of quasi-identifying 
attributes into equivalence classes. Since such data tables are intended for use in data mining, we consider 
the natural optimization criterion of minimizing the maximiun size of any equivalence class, subject 
to the constraint that each is of size at least k. We show that it is impossible to achieve arbitrarily 
good polynomial-time approximations for a number of natural variations of the generalization technique, 
unless P = NP, even when the table has only a single quasi-identifying attribute that represents a 
geographic or unordered attribute: 

• Zip-codes: nodes of a planar graph generalized into cormected subgraphs 

• GPS coordinates: points in generalized into non-overlapping rectangles 

• Unordered data: text labels that can be grouped arbitrarily. 

These hard single-attribute instances of generalization problems contrast with the previously known 
NP-hard instances, which require the number of attributes to be proportional to the number of individual 
records (the rows of the table). In addition to impossibility results, we provide approximation algorithms 
for these difficult single-attribute generahzation problems, which, of course, apply to multiple-attribute 
instances with one that is quasi-identifying. We show theoretically and experimentally that our approx- 
imation algorithms can come reasonably close to optimal solutions. Incidentally, the generalization 
problem for unordered data can be viewed as a novel type of bin packing problem — min-max bin cover- 
ing — which may be of independent interest. 

1 Introduction 

Data mining is an effective means for extracting useful information from various data repositories, to high- 
light, for example, health risks, political trends, consumer spending, or social networking. In addition, some 
pubUc institutions, such as the U.S. Census Bureau, have a mandate to publish data about U.S. communi- 
ties, so as to benefit socially-useful data mining. Thus, there is a public interest in having data repositories 
available for public study through data mining. 

Unfortunately, fulfilling this public interest is complicated by the fact that many databases contain con- 
fidential or personal information about individuals. The publication of such information is therefore con- 
strained by laws and policies governing privacy protection. For example, the U.S. Census Bureau must Umit 
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its data releases to those that reveal no information about any individual. Thus, to allow the public to benefit 
from the knowledge that can be gained through data mining, a privacy -protecting transformation should be 
performed on a database before its publication. 



One of the greatest threats to privacy faced by database publication is a linking attack \ 25 24 1. In this 
type of attack, an adversary who already knows partial identifying information about an individual (e.g., a 
name and address or zip-code) is able to identify a record in another database that belongs to this person. A 
linking attack occurs, then, if an adversary can "link" his prior identifying knowledge about an individual 
through a non-identifying attribute in another database. Non-identifying attributes that can be subject to 
such linking attacks are known as quasi-identifying attributes. 



To combat linking attacks, several researchers 1 20 24}[25|[22| [2)[6l 32 5| have proposed generalization as 



a way of specifying a quantifiable privacy requirement for published databases. The generalization approach 
is to group attribute values into equivalence classes, and replace each individual attribute value with its class 
name. Of course, we desire a generalization that is best for data mining purposes. Thus, we add an additional 
constraint that we minimize the cost, C, of our chosen generalization, where the cost of a set of equivalence 
classes {£'1, i?2, • • • , Em} is C = J2iLi c{Ei), where c is a cost function defined on equivalence classes 
and the summation is defined in terms of either the standard "+" operator or the "max" function. 

The cost function c should represent an optimization goal that is expected to yield a table of transformed 
data that is the best for data mining purposes, e.g., while preserving A;-anonymity and/or /-diversity | [2T| . 
Thus, we define the cost function c in terms of a generalization error, so, for an equivalence class E = 

{xi,X2,...,Xq}, 

Q 

c{E) = Y,d{x,,E), 

i=l 

where d is a measure of the difference between an element and its equivalence class, and the summations are 
defined in terms of either the standard "+" operator or the "max" function. For example, using d{xi, E) = 1, 
and taking "+" in the definition of C to be "max", amounts to a desire to minimize the maximum size of 
any equivalence class. Alternatively, using d{xi,E) = \E\ — k and standard addition in the definition of 
C will quadratically penalize larger equivalence classes. In this paper we focus on generalization methods 
that try to minimize the maximum size of any equivalence class, subject to lower bounds on the size of any 
equivalence class. This should also have the side effect of reducing the number of generalizations done, but 
we focus on minimizing the maximum equivalence class here, as it leads to an interesting type of bin-packing 
problem, which may be of interest in its own right. 

Since most prior work on fc-anonymization algorithms has focused on numeric or ordered data, we are 
interested in this paper on techniques that can be applied on geographic and unordered data. Such data 
commonly occurs in quasi-identifying attributes, but such attributes seem harder to generalize to achieve 
/c-anonymity. Thus, we are interested in the degree to which one can approximate the optimal way of 
generalizing geographic and unordered data, using natural generalization schemes. 



Related Prior Results. The concept of A;-anonymity |25 24 1, although not a complete solution to link- 
ing attacks, is often an important component of such solutions. In this application of generalization, the 
equivalence classes are chosen to ensure that each combination of replacement attributes that occurs in 
the generalized database occurs in at least k of the records. Several researchers have explored heuristics. 



extensions, and adaptations for A;-anonymization (e.g., see p0||2||6||32|p||29| ) . 

As mentioned above, generalization has become a popular way of altering a database (represented as 
a table) so that it satisfies the fe-anonymity requirement, by combining attribute values into equivalence 



2 



classes. To guide this "combining" process for a particular attribute, a generalization hierarchy (or concept 
hierarchy) is often specified, which is either derived from an ordering on the data or itself defines an ordering 
on the data. 

Unfortunately, there is no obvious tree hierarchy for geographic and unordered data. So, for unordered 
data, several researchers have introduced heuristics for deriving hierarchies that can then be used for gener- 
alization. Security properties of the randomization schemes and privacy-preserving data mining in general 
are studied by Kargupta et al. 1 18], Kantarcioglu et al. [ 17 1, and Huang et al. {Id} . (See also Fung yj|.) 
Wang et al. | [30| used an iterative bottom-up heuristic to generalize data. 

The use of heuristics, rather than exact algorithms, for performing generalization is motivated by claims 
that /c-anonymization-based generalization is NP-hard. Meyerson and Williams fTT\ assume that an input 
dataset has been processed into a database or table in which identical records from the original dataset have 
been aggregated into a single row of the table, with a count representing its frequency. They then show 
that if the number of aggregated rows is n and the number of attributes (table columns) is at least 3n, then 
generalization for fc-anonymization is NP-hard. Unfortunately, their proof does not show that generalization 
is NP-hard in the strong sense: the difficult instances generated by their reduction have frequency counts that 
are large binary numbers, rather than being representable in unary. Therefore, their result doesn't actually 
apply to the original /c-anonymization problem. Aggarwal et al. Q address this deficiency, showing that 
/c-anonymization is NP-hard in the strong sense for datasets with at least n/3 quasi-identifying attributes. 
Their proof uses cell suppression instead of generalization, but Byun et al. [6] show that the proof can be 
extended to generalization. As in the other two NP-hardness proofs, Byun et al. require that the number of 
quasi-identifying attributes be proportional to the number of records, which is typically not the case. Park 
and Shim f23l present an NP-hardness proof for a version of A;-anonymization involving cell suppression in 
place of generalization, and Wong et al. [31] show an anonymity problem they call (a, A;)-anonymity to be 
NP-hard. 



Khanna etal. 1 19 1 study a problem, RTILE, which is closely related to generalization of geographic data. 
RTILE involves tiling an re x n integer grid with at most p rectangles so as to minimize the maximum weight 
of any rectangle. They show that no polynomial-time approximation algorithm can achieve an approxima- 
tion ratio for RTILE of better than 1.25 unless P=NP. Unlike /c-anonymization, however, this problem does 
not constrain the minimum weight of a selected rectangle. Aggarwal |[l| studies the problem of generalizing 
multidimensional data using axis-aligned rectangles using probabilistic clustering techniques, and Hore et 
al. f\5\ study a heuristic based on the use of kd-tree partitioning and a search strategy optimized through the 
use of priority queues. Neither of these papers gives provable approximation ratios, however. 



Our Results. In this paper, we study instances of fc-anonymization-based generalization in which there is 
only a single quasi-identifying attribute, containing geographic or unordered data. In particular, we focus 
on the following attribute types: 

• Zip-codes: nodes of a planar graph generalized into connected subgraphs 

• GPS coordinates: points in generahzed into non-overlapping rectangles 

• Unordered data: text labels that can be grouped arbitrarily (e.g., disease names). 

We show that even in these simple instances, /c-anonymization-based generalization is NP-complete in the 
strong sense. Moreover, it is impossible to approximate these problems to within (1 + e) of optimal, where 
e > is an arbitrary fixed constant, unless P = NP. These results hold a fortiori for instances with 
multiple quasi-identifying attributes of these types, and they greatly strengthen previous NP-hardness results 
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which require unrealistically large numbers of attributes. Nevertheless, we provide a number of efficient 
approximation algorithms and we show, both in terms of their worst-case approximation performance and 
also in terms of their empirical performance on real-world data sets, that they achieve good approximation 
ratios. Our approximation bounds for the zip-codes problem require that the graph has sufficiently strong 
connectivity to guarantee a sufficiently low-degree spanning tree. 

The intent of this paper is not to argue that single-attribute generalization is a typical application of 
privacy protection. Indeed, most real-world anonymization applications will have dozens of attributes whose 
privacy concerns vary from hypersensitive to benign. Moreover, the very notion of fc-anonymization has 
been shown to be insufficient to protect against all types of linking attack, and has been extended recently 
in various ways to address some of those concerns (e.g., see 1 10 21 31 1); some work also argues against 
any approach similar to A;-anonymization |12|. We do not attempt to address this issue here. Rather, our 
results should be viewed as showing that even the simplest forms of fc-anonymization-based generalization 
are difficult but can be approximated. We anticipate that similar results may hold for its generalizations and 
extensions as well. 

In addition, from an algorithmic perspective, our study of fc-anonymization-based generalization has 
uncovered a new kind of bin-packing problem (e.g., see [9]), which we call Min-Max Bin Covering. In this 
variation, we are given a collection of items and a nominal bin capacity, k, and we wish to distribute the 
items to bins so that each bin has total weight at least k while minimizing the maximum weight of any bin. 
This problem may be of independent interest in the algorithms research community. 

Incidentally, our proof that A;-anonymization is NP-hard for points in the plane can be easily adopted to 
show that the RTILE problem, studied by Khanna et al. p9| , cannot be approximated in polynomial time 
by a factor better than 1.33, unless P=NP, which improves the previous non-approximability bound of 1.25. 



2 Zip-code Data 

The first type of quasi-identifying information we consider is that of zip-codes, or analogous numeric codes 
for other geographic regions. Suppose we are given a database consisting of n records, each of which con- 
tains a single quasi-identifying attribute that is itself a zip-code. A common approach in previous papers 
using generalization for zip-code data (e.g., see |6,32j) is to generalize consecutive zip-codes. That is, these 
papers view zip-codes as character strings or integers and generalize based on this data type. Unfortunately, 
as is illustrated in Figure [TJwhen zip-codes are viewed as numbers or strings, geographic adjacency infor- 
mation can be lost or misleading: consecutive zip codes may be far apart geographically, and geographically 
close zip codes may be numerically far, leading to generalizations that have poor quality for data mining 
applications. 

We desire a generalization algorithm for zip-codes that preserves geographic adjacency. Formally, we 
assume each zip-code is the name of a node in a planar graph, G. The most natural generalization in this case 
is to group nodes of G into equivalence classes that are connected subgraphs. This is motivated, in the zip- 
code case, by a desire to group adjacent regions in a country, which would naturally have more likelihood 
to be correlated according to factors desired as outcomes from data mining, such as health or buying trends. 
So the optimization problem we investigate in this section is one in which we are given a planar graph, G, 
with non-negative integer weights on its nodes (representing the number of records for each node), and we 
wish to partition G into connected subgraphs so that the maximum weight of any subgraph is minimized 
subject to the constraint that each has weight at least k. 
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Figure 1: The US ZIPScribble Map, which connects consecutive zipcodes. Note the lack 
of proximity preservation in the West and the artificial separations in the East. (From 
http://eagereyes.org/Applications/ZIPScribleMap.html.) 

Generalization for Zip-codes is Hard. Converting this to a decision problem, we can add a parameter K 
and ask if there exists a partition into connected subgraphs such that the weight of each subgraph in G is at 
least k and at most K. In this section, we show that this problem is NP-complete even if the weights are all 
equal to 1 and /c = 3. Our proof is based on a simple reduction that sets = 3, so as to provide a reduction 
from the following problem: 

3-Regular Planar Partition into Paths of Length 2 (3PPPL2): Given a 3-regular planar graph G, can G 
be partitioned into paths of length 2? That is, is there a spanning forest for G such that each connected 
component is a path of length 2? 

This problem is a special case of the problem, "Partition into Paths of Length-2 (PPL2)", whose NP- 
completeness is included as an exercise in Garey and Johnson [14J . Like PPL2, 3PPPL2 is easily shown to 
be in NP. To show that 3PPPL2 is NP-hard, we provide a reduction from the 3-dimensional matching (3DM) 
problem: 

3-Dimensional Matching (3DM): Given three sets X, Y, and Z, each of size n, and a set of triples 
{(xi, yi,zi), . . . , {xm, Vm, Zm)}, is there a subset S of n triples such that each element in X, Y, and Z 
is contained in exactly one of the triples? 

Suppose we are given an instance of 3DM. We create a vertex for each element in X, Y, and Z. For each 
tuple, {xi, Ui, Zi), we create a tuple subgraph gadget as shown in Figure [2^, with nodes ti^x, ti,y, and 2> 
which correspond to the representatives Xi, yi, and Zi, respectively, in the tuple. We then connect each ti^x, 
ti^y and ti^z vertex to the corresponding element vertex from X, Y, and Z, respectively, using the connector 
gadget in Figure [2j). 

This construction is, in fact, a version of the well-known folklore reduction from 3DM to PPL2, which 
solves an exercise in Garey and Johnson |[T4| . Note, for example, that the vertices in the triangle in the tuple 
gadget must all three be completely included in a single group or must all be in separate groups. If they are 
all included, then grouping the degree- 1 vertices requires that the corresponding x, y, and z elements must 
all be included in a group with the degree- 1 vertex on the connector. If they are all not included, then the 
corresponding x, y, and z elements must be excluded from a group in this set of gadgets. 

Continuing the reduction to an instance of 3PPPL2, we make a series of transformations. The first is 
to embed the graph in the plane in such a way that the only crossings occur in connector gadgets. We then 
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take each crossing of a connector, as shown in Figure [3^, and replace it with the cross-over gadget shown in 
Figure [3J5. 

There are four symmetric ways this gadget can be partitioned into paths of length 2, two of which are 
shown in Figures and |5|5. Note that the four ways correspond to the four possible ways that connector 
"parity" can be transmitted and that they correctly perform a cross-over of these two parities. In particular, 
note that it is impossible for opposite connectors to have the same parity in any partition into paths of length 
2. Thus, replacing each crossing with a cross-over gadget completes a reduction of 3DM to planar partition 
in paths of length 2. 

Next, note that all vertices of the planar graph are degree-3 or less except for the "choice" vertices at the 
center of cross-over gadgets and possibly some nodes corresponding to elements of X, Y, and Z. For each 
of these, we note that all the edges incident on such nodes are connectors. We therefore replace each vertex 
of degree-4 or higher with three connector gadgets that connect the original vertex to three binary trees 
whose respective edges are all connector gadgets. This allows us to "fan out" the choice semantics of the 
original vertex while exclusively using degree-3 vertices. To complete the reduction, we perform additional 
simple transformations to the planar graph to make it 3-regular. In particular, we add to each degree- 1 vertex 
the "cap" gadget shown in Figure[4^. Likewise, we add to each degree-2 vertex the cap shown in Figure[4j). 
Note that in both cases, these subgraphs must be partitioned into paths of length 2 that do not extend outside 
the subgraph. Thus, adding these subgraphs to the original graph does not alter a partition into paths of 
length 2 for the original graph. This completes the reduction of 3DM to 3PPPL2. 




Figure 2: Gadgets for reducing 3DM to PPL2. (a) the tuple gadget; (b) the connector. 




(a) (b) 
Figure 3: Dealing with edge crossings, (a) a connector crossing; (b) the cross-over gadget. 
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(a) (b) 



Figure 4: Augmentations to achieve 3-regularity. (a) a "cap" to add to a degree- 1 vertex; (b) a "cap" to add 
to a degree-2 vertex. 

An Approximation Algorithm for Zip-codes. In this section we provide an approximation algorithm for 
fc-anonymization of zip-codes. Suppose, therefore, that we are given a connected planar graph G with non- 
negative integer vertex weights, and we wish to partition G into connected subgraphs of weight at least k, 
while minimizing the maximum weight of any subgraph. 

We start by forming a low-degree spanning tree T of G; let d be the degree of T. We note that 3- 
connected planar graphs are guaranteed to have a spanning tree of degree three [4 |, giving d = 3, while 
Tutte [27 1 proved that 4-connected planar graphs are always Hamiltonian, giving d = 2; see ||7j|8]|26j| for 
algorithms to construct T efficiently in these cases. We then find an edge e such that removing e from T 
leaves two trees Ti and T2, both of weight at least k, with the weight of Ti as small as possible. If such an 
edge exists, we form one connected subgraph from Ti and continue to partition the remaining low-degree 
tree T2 in the same fashion; otherwise, we form a single connected subgraph from all of T. 

Let K = max(A;, xi, X2, • • •) where Xi are the individual item sizes; clearly, the optimal cost of any 
solution is at least k. 

Lemma 1 If the algorithm outlined above cannot find any edge e to split T, then the cost ofT is at most 
K + d{k- 1). 

Proof: Orient each edge e of T from the smaller weight subtree formed by splitting at e to the larger weight 
subtree; if a tie occurs break it arbitrarily. Then T must have a unique vertex v at which all edges are oriented 
inwards. The weight of v is at most k, and it is adjacent to at most d subtrees each of which has weight 
at most k — 1 (or else the edge connecting to that subtree would have been oriented outwards) so the total 
weight of the tree is at most k + d{k — 1) as claimed. ■ ■ 

Lemma 2 If the algorithm above splits tree T into two subtrees Ti and T2, then the cost of Ti is at most 
K + {d-l){k-l). 

Proof: Let v be the node in Ti adjacent to the cut edge e. Then the weight of v is at most k, and v is 
adjacent to at most d — 1 subtrees of Ti (because it is also adjacent to e). Each of these subtrees has weight 
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(a) 

Figure 5: Dealing with crossings, 
four possible partitions. 



(b) 

(a) one possible partition of the cross-over gadget; (b) another one of the 



at most A; — 1, or else we would have cut one of the edges connecting to them in preference to the chosen 
edge e. Therefore, the total weight is at most k + {d — l){k — 1) as claimed. ■ ■ 

Theorem 3 There is a polynomial-time approximation algorithm for k-anonymization on planar graphs 
that guarantees an approximation ratio of 4 for 3-connected planar graphs and 3 for 4-connected planar 
graphs. It is not possible for a polynomial-time algorithm to achieve an approximation ratio better than 
1.33, even for 3 -regular planar graphs, unless P=NP. 

Proof: For 3-connected planar graphs, d = 3, and the lemmas above show that our algorithm produces a 
solution with quality at most k + 3(A; — 1) < 4k < 4 OPT. Similarly, for 4-connected planar graphs, d = 2 
and the lemmas above show that our algorithm produces a solution with quality at most k -\- 2{k — 1) < 
3k < 3 OPT. 

The inapproximability result follows from the NP-completeness result in the main text of the paper, as 
the graph resulting from that reduction either has a partition into 3-vertex connected subgraphs or some 
subgraph requires four or more vertices. ■ ■ 



3 GPS-Coordinate Data 

Next we treat geographic data that is given as geographic coordinates rather than having already been gen- 
eralized to zip-codes. Suppose we are given a table consisting of n records, each of which contains a single 
quasi-identifying attribute that is itself a GPS coordinate, that is, a point (x, y) in the plane. For example, 
the quasi-identifying attribute could be the GPS coordinate of a home or elementary school. Suppose further 
that we wish to generalize such sets of points using axis-aligned rectangles. 
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Generalizing GPS-Coordinates is Hard. Converting this to a decision problem, we can add a parameter 
K and ask whether there exists a partition of the plane into rectangles such that the weight of the input points 
within each rectangle is at least k and at most K. In this section, we show that this problem is NP-complete 
even when we set k and K equal to 3. Our proof is based on a simple reduction from 3-dimensional matching 
(3DM). 

Suppose we are given an instance of 3DM. We first reduce this instance to a rectangular /c-anonymization 
problem inside a rectilinear polygon with holes; we show later how to replace the edges of the polygon by 
points. We begin by creating a separate point for each element in X, Y, and Z. For each tuple, {xi,yi, z-i), 
we create a tuple gadget as shown in Figure [6^; the three points in the interior must all be contained in 
a single rectangle or each of them must be in a separate rectangle joining the two points that sit in the 
"doorway" of a "corridor" Inside the corridor, we alternatively place singleton points and pairs of points, 
placing each singleton or pair at a corner of the corridor, so that the only way to cover three points within the 
corridor is to use both points of a pair and one nearby singleton point; thus, any covering of all of the points 
of the corridor by rectangles containing exactly three points must preserve the parity of the connections at 
the adjacent doorways. For each tuple {xi,yi, Zi), we route the corridors from our tuple gadget to each of 
the points Xj, yi, and zi, so that the corridors for any point, such as Xj, meet in a chooser gadget as shown 
in Figure [6J3. Note: if the degree of a point grows to more than three, we can fan-in the corridors in binary 
trees whose internal nodes are represented with chooser gadgets. Of course, some corridors may cross each 
other in this drawing, in which case we replace each corridor crossing with the crossing gadget shown in 
Figure [6];. 



• • : 

(a) (b) (c) 

Figure 6: Gadgets for reducing 3DM to rectangular fc-anonymization in the plane, (a) the tuple gadget; (b) 
the chooser gadget; (c) the cross-over gadget. 

When we have completed this construction, we will have reduced 3DM to a rectangle /c-anonymization 
problem inside a rectilinear polygon P containing holes that has its points and polygon vertices on a 
polynomial-sized integer grid. 

To complete the construction, then, we place 3 (identical) points at every grid location that is not properly 
in the interior of P. Each such set of three points must be partitioned into a separate rectangle, which will 
block any rectangle containing points properly inside P from crossing the boundary of P without increasing 
its weight greater than k. Thus, we can "erase" P at this point and we will have reduced 3DM to an instance 
of rectangular fc-anonymization in the plane, for /c = 3. 

An Approximation Algorithm for GPS Coordinates. In this subsection, we provide an approximation 
algorithm for GPS coordinates. Suppose, therefore, that we are given a set of points S and we wish to 
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partition the plane into axis-aligned rectangles so as to minimize the maximum weight of any rectangle. We 
construct a kd-tree on S, using the cutting rule of always splitting a rectangle with an axis-aligned cutting 
line if it is possible to create two subrectangles each of weight at least k. When we can no longer cut 
rectangles, satisfying this criterion, we stop. We note that this will produce a good approximation to the 
optimal solution, with a worst-case degenerate case being four points of multiplicity k — 1 placed at the 
N, S, E, and W directions of a point of multiplicity k. It may be possible for such a configuration to be 
partitioned into rectangles of size k, whereas our approximation may, in the worst case, produce a rectangle 
of weight 5 A; — 4 in this case. Therefore, we have the following: 

Theorem 4 There is a polynomial-time approximation algorithm for rectangular generalization, with re- 
spect to k-anonymization in the plane, that achieves an approximation ratio of 5 in the worst case. It is not 
possible for a polynomial-time algorithm to achieve an approximation ratio better than 1.33 unless P=NP. 

We note that a similar reduction to the one we give above can be used to show that no polynomial-time 
algorithm can achieve an approximation ratio better than 1.33 for the RTILE problem, unless P=NP, which 
improves the previous lower bound for this problem of 1.25 p9} . 

4 The Min-Max Bin Covering Problem 

In this section, we examine single-attribute generalization, with respect to the problem of /c-anonymization 
for unordered data, where quasi-identifying attribute values are arbitrary labels that come from an unordered 
universe. (Note that if the labels were instead drawn from an ordered universe, and we required the gen- 
eralization groups to be intervals, the resulting one-dimensional /c-anonymization problem could be solved 
optimally in polynomial time by a simple dynamic programming algorithm.) Our optimization problem, 
then, is to generalize the input labels into equivalence classes so as to minimize the maximum size of any 
equivalence class, subject to the /c-anonymization constraint. 

It is convenient in this context to use the terminology of bin packing; henceforth in this section we refer 
to the input labels as items, the equivalence classes as bins, and the entire generalization as a packing. The 
size of an item corresponds in this way to the number of records having a given label as their attribute value. 
Thus the problem becomes the following, which we call the Min-Max Bin Covering Problem: 

Input: Positive integers xi,X2, ■ ■ ■ ,Xn and an integer nominal bin capacity k > 0. 

Output: a partition of {1, 2, . . . , n} into subsets Sj, satisfying the constraint that, for each j, 




(1) 



and minimizing the objective function 




(2) 



We will say that a partition satisfying ([T]l for all j is feasible, and the function shown in Q is the cost of this 
partition. Note that any feasible solution has cost at least k. 
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Hardness Results. In this subsection, we show that Min-Max Bin Covering is NP-hard in the strong sense. 
We begin by converting the problem to a decision problem by adding a parameter K, which is intended as 
an upper bound on the size of any bin: rather than minimizing the maximum size of an bin, we ask whether 
there exists a solution in which all bins have size at most K. This problem is clearly in NP. 

We show that Min-Max Bin Covering is NP-hard by a reduction from the following problem, which is 
NP-complete in the strong sense p4| . 

• 3-Partition. Given a value B, and a set S of 3m weights wi, W2, • • • , w^m each lying in {B/A, B /2), 
such that X]i=i = mB, can we partition {1, 2, . . . , 3m,} into sets Sj such that for each j, 
X]jg5 Wi = Bl (Note that any such family of sets Sj would have to have exactly m members.) 

For the reduction we simply let Xi = Wi and k = K = B. If the 3-Partition problem has answer yes, 
then we can partition the items into m sets each of total size k = K = B so the Min-Max Bin Covering 
problem has answer yes. If, on the other hand, the 3-Partition problem has answer no, no such partition is 
possible, so we have 

Theorem 5 Min-Max Bin Covering is NP-complete in the strong sense. 

In the preprint version of this paper fTPl, we show that there are limits on how well we can approximate 
the optimum solution (unless P = NP): 

Theorem 6 Assuming P ^ NP, there does not exist a polynomial-time algorithm for Min-Max Bin-Covering 
that guarantees an approximation ratio better than 2 (when inputs are expressed in binary), or better than 
4/3 (when inputs are expressed in unary). 

Achievable Approximation Ratios. While the previous section shows that sufficiently small approxima- 
tion ratios are hard to achieve, in this section we show that we can establish larger approximation bounds 
with polynomial time algorithms. The algorithms in this section can handle inputs that are expressed either 
in unary or binary, so they are governed by the stronger lower bound of 2 on the approximation ratio given 
in Theorem[6] If A is some algorithm for Min-Max Bin Covering Problem, and / is some instance, let A{I) 
denote the cost of the solution obtained by A. Let Opt(/) denote the optimum cost for this instance. 

Note that if XlILi < ^' there is no feasible solution; we will therefore restrict our attention to instances 
for which 

n 

^Xi>k. (3) 

i=l 

An approximation ratio of three is fairly easy to achieve. 
Theorem 7 Assuming Q there is a linear-time algorithm A guaranteeing that 

A{I) < max(fc — 1 + max 3^ — 3). 

i=l 

Proof: Put all items of size k or greater into their own bins, and then, with new bins, use the Next Fit 
heuristic for bin covering (see |T|) for the remaining items, i.e., add the items one at a time, moving to a new 
bin once the current bin is filled to a level of at least k. Then all but the last bin in this packing have level 
at most 2 A; — 2, as they each have level at most k — 1 before the last item value is added and this last item 
has size less than k. There may be one leftover bin with level less than k which must be merged with some 
other bin, leading to the claimed bound. ■ 
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With a bit more effort we can improve the approximation ratio. For convenience, in the remainder of 
this section we scale the problem by dividing the item sizes by k. Thus each bin must have level at least 1 , 
and the item sizes are multiples of 1/A;. 

Lemma 8 Suppose we are given a list of numbers xi,X2, • • • , Xn, with each xi < 1/2 and 'Yl^=i — 1- 
Then we can partition the list into three parts each having a sum of at most 1/2. 

Proof: Omitted. 

Theorem 9 There is a polynomial algorithm to solve Min-Max Bin Packing with an approximation factor 
0/5/2. 

Proof: We will assume without loss of generality that Opt(/) > 6/5, since otherwise the algorithm of 
Theorem [t] could give a 5/2-approximation. 

Assume the items are numbered in order of decreasing size. Pack them greedily in this order into 
successive bins, moving to a new bin when the current bin has level at least 1 . Note that then all of the bins 
will have levels less than 2, and all of the bins except the last will have level at least 1 . If the last bin also has 
level at least 1, this packing is feasible and has cost less than 2, so it is within a factor of 2 of the optimum. 

Next suppose that the last bin has level less than 1 . We omit the details for the case in which we have 
formed at most 3 bins, and subsequently we assume we have formed at least 4 bins. 

Now let / be size of the largest item in the final bin, and let r be the total size of the other items in the 
last bin. Call an item oversize if its size is at least 1, large if its size is in (1/2,1), and small if its size is at 
most 1/2. Consider two cases. 

Case 1. / < 1/2. Then all items in the last bin are small, so by Lemma [S] we can partition them into 
three sets, each of total size at most 1/2. Add each of these sets to one of the first three bins, so no bin is 
filled to more than 5/2, unless it was one of the bins containing an oversize item. (We no longer use the last 
bin.) Thus we have achieved an approximation ratio of 5/2. 

Case 2. / > 1/2. Note that in this case there must be an odd number of large items, since each bin 
except the last has either zero or exactly two large items. Note also that r in this case is the total size of 
the small items, and r < 1/2. Let xi be the first large item packed. If xi lies in the last bin, we must have 
packed at least one oversize item. Then moving all of the items from the last bin (which will no longer be 
used) into the bin with this oversize item guarantees a 2-approximation. Thus assume xi is not in the last 
bin. 

Case 2.1. xi + r > 1. Then swap items xi and /, so the last bin will be filled to a level xi + r G [1,2]. 
Also, the bin now containing / will contain two items of size in the range [1/2,1] and thus have a level in 
the range [1,2]. Thus we have a solution that meets the constraints and has cost at most 2. 

Case 2.2. xi + r < 1. Since r is the total size of the small items, if any bin had only one large item it 
could not have level at least 1 (as required for feasibility) and at most 6/5 (as required since Opt(/) < 6/5). 
Thus the optimum solution has no bin containing only one large item. Since there are an odd number of 
large items, this means that the optimum solution has at least one bin with 3 or more large items, so the cost 
of the optimum solution is at least 3/2. But then since the simple algorithm of Theorem [7] gives a solution 
of cost less than 3, it provides a solution that is at most twice the optimum. ■ 

A Polynomial Time Approximation Guaranteeing a Ratio Approaching 2. With more effort we can 
come arbitrarily close to the lower bound of 2 on the approximation factor given in Theorem[6]for the binary 
case, with a polynomial algorithm. 
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Theorem 10 For each fixed e > 0, there is a polynomial time algorithm that, given some instance I of 
Min-Max Bin Covering, finds a solution satisfying 

< (l + e)(Opt(/) + l). (4) 

(The degree of the polynomial becomes quite large as e becomes small.) 
Proof: The idea of the proof is similar to many approximation algorithms for bin packing (see in partic- 
ular (28} Chapter 9]); for the current problem, we have to be especially careful to ensure that the solution 
constructed is feasible. 

We can assume that the optimum cost is at most 3, by the following reasoning. Say an item is nominal 
if its size is less than 1, and oversize if its size is greater than or equal to 1. First suppose the total size 
of the nominal items is at least 1 and some oversize item has size at least 3. Then the greedy algorithm of 
Theorem [T] achieves an optimum solution, so we are done. Next suppose the sum of the nominal items is at 
least 1 and no oversize item has size at least 3. Then the greedy algorithm of Theorem [7] achieves a solution 
of cost at most 3, so the optimum is at most 3. Finally suppose that the total size of the nominal items is less 
than 1 . Then there must be an optimum solution in which every bin contains exactly one oversize item (and 
possibly some nominal items). Let (resp. ti) be the size of the smallest (resp. largest) oversize item. If 
— ^0 ^ 1> then we can form an optimum solution by putting all nominal items in a bin with Iq. If on the 
other hand ti — to < 1, we can reduce the size of all oversize items by to — 1 without changing the structure 
of the problem, after which all oversize items will have size at most 2, and the optimum will be at most 3. 

Now call those items that have size greater than or equal to e large, and the others small. Let h = 
Z]r=i '^^'■^ ^^^^ ^ — ^^'^ feasible partition will have at most b bins. Let be the largest integer 
for which e(l + e)^ is less than three; note that is a constant depending only on e. Let 

S = {e{l + eY -.iG {0,1,2,..., N}}. 

For any item size x, define round{x) to be the largest value in S that is less than or equal to x. Let the type 
of a packing P, written type{P), be the result of discarding all small items in P, and replacing each large 
Xi by round{xi). Note that any type can be viewed as a partial packing in which the bins contain only items 
with sizes in S. 

Since, for fixed e, there are only a constant number of item sizes in S, and each of these is at least e, 
there are only finitely many ways of packing a bin to a level of at most 3 using the rounded values; call each 
of these ways a configuration of a bin. Since the ordering of the bins does not matter, we can represent the 
type of a packing by the number of times it uses each configuration. It is not hard to show that for fixed e, as 
in the proof of | [28j Lemma 9.4], there are only polynomially many types having at most b bins. (Of course, 
for small e, this will be a polynomial of very high degree.) We will allow types that leave some of the bins 
empty, allowing them to be filled later. 

The algorithm proceeds as follows. Enumerate all possible types T that can be formed using the rounded 
large item sizes. For each such type T carry out the following steps: 

1 . Let T' be the result of replacing each item x in T, which resulted from rounding some original input 
item Xi, by any one of the original items xj such that x = round{xj), in such a way that the set of 
items in T' is the same as the set of large items in the original input. Note that there is no guarantee 
that Xi = Xj, since the rounding process does not maintain the distinct identities of different items 
that round to the same value in 5". However, we do know that round{xi) = round{xj), so we can 
conclude that xj/xi G ((1 + e)~^, 1 + e). 
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2. Pack the small items into T' by processing them in an arbitrary order, placing each into the bin with 
the lowest current level. Call this the greedy completion of T. 



3. Finally, while any bin has a level less than 1, merge two bins with the lowest current levels. Note that 
this will lead to a feasible packing because of ([s]). Call the resulting packing J^{T), and let cost{!F{T)) 
be the maximum level to which any bin is filled. 

Return the packing J^{T) that minimizes cost{J^{T) ) over all T. 

We now show that Q holds. Let P* be a feasible packing achieving Opt(/), and let Pi*j.g(; be the 
result of discarding the small items in P* (retaining any bins that become empty). Consider the type T 
obtained by rounding all large items in P^^j.^^ down to a size in S. Note that this must be one of the types, 
say T, considered in the algorithm. When we perform step 1 on T, we obtain a packing T' such that 
cost(r') < (1 + e)cost(P*). 

If any level in the greedy completion is greater than (H-e)Opt(I)+e, then during the greedy completion 
all bins must have reached a level greater than (1 + e)Opt(/), so their total size would be greater than 
(1 + e) Y17=i contradicting the fact that the greedy completion uses each of the original items exactly 
once. Thus all bins in the greedy completion have level at most (1 + e)Opt(/) + e. Also, it cannot be that 
all bins in the greedy completion have level less than 1 , since then the total size of the items would be less 
than the number of bins, contradicting the fact that the optimum solution covers all the bins. 

During step 3, as long as at least two bins have levels below 1, two of them will be merged to form 
a bin with a level at most 2. If then only one bin remains with a level below 1, it will be merged with a 
bin with level in [1, (1 + e)Opt(/) + e) to form a feasible packing with no bin filled to a level beyond 
(1 + e)Opt(/) + 1 + e, as desired. ■ 



Note that the bound of Theorem [TO] implies A^{I) < 2(1 + e)Opt(I). 

We also note that if one is willing to relax both the feasibility constraints and the cost of the solution 
obtained, a polynomial-time (1 + e) approximation scheme of sorts is possible. (Of course, this would not 
guarantee /c-anonymity.) 

Theorem 11 Assume that all item sizes Xi in the input are expressed in binary, and let e > be a fixed 
constant. There is a polynomial time algorithm that, given some instance I of Min-Max Bin Covering, finds 
a partition of the items into disjoint bins Sj such that 

Mj > 1 — e, and max Xi < {1 + e)Opt(/). (5) 

Proof: [sketch] Roughly, one can use an algorithm similar to that of the previous theorem but omitting the 
last phase in which we merge bins to eliminate infeasibility. We omit the details. ■ 



5 Experimental Results 

In this section, we give experimental results for implementations and extensions of a couple of our ap- 
proximation algorithms. Because of space limitations, we focus here on approximate /c-anonymization for 
unordered data. 

So as to represent a real distribution of quasi-identifying information, we have chosen to use the follow- 
ing data sets provided by the U.S. Census Bureau from the 1990 U.S. Census: 

• FEMALE-1990: Female first names and their frequencies, for names with frequency at least 0.001%. 
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• MALE-1990: Male first names and their frequencies, for names with frequency at least 0.001%. 

• LAST-1990: Surnames and their frequencies, for surnames with frequency at least 0.001%. 

For each data set, we ran a number of experiments, so as to test the quality of the approximations 
produced by the method of Theorem|7] which we call Fold, and compare that with the quality of the approx- 
imations produced by a simplified version of the method of Theorem [9j which we call Spread, for all values 
of k ranging from the frequency of the most common name to the value of k that results in there being only 
two equivalence classes. 

The simplification we implemented for the method of Theorem [9] involves the distribution of left-over 
items at the end of the algorithm. In this case, we distribute left-over items among existing equivalence 
classes using a greedy algorithm, where we first add items to classes that have less than the current maxi- 
mum until adding an item to any class would increase the maximum. At that point, we then distribute the 
remaining items to equivalence classes in a round-robin fashion. 

We tested both approaches on each of the above data sets, with the data being either randomly ordered 
or sorted by frequencies. For each test, we analyzed the ratio of the size of the largest equivalence class to 
k, the anonymization parameter, since this ratio serves as an upper bound on the algorithm's approximation 
factor. The overfull ratios for each algorithm is reported for each of the above data sets in Figure |7] 

There are number of interesting observations we can make from our experimental results, including the 
following: 

• The Spread algorithm is superior to the Fold algorithm, for both randomly-ordered and sorted data. 

• Generalizing data into equivalence classes based on a random ordering of the frequencies is often 
superior to a sorted order. 

• When considering values of k in increasing order, there are certain threshold values of the parameter 
k where the number of equivalence classes drops by one, and this drop has a negative effect on the 
overfull ratio. The negative effect is especially pronounced for the Fold algorithm. (This behavior is 
what causes the increasing "jagginess" towards the right of the ratio plots.) 

• The performance of both the Fold and Spread algorithms on these real- world data sets is much better 
than the worst-case analysis. 

Thus, our algorithms confirm our intuition about the Spread algorithm being better than the Fold al- 
gorithm. In addition, our experimental analysis shows that the Spread algorithm performs quite well on 
real-world data sets. 

6 Future Directions 

There are a number of interesting directions for future work. For example, real world data sets often have 
two or three quasi-identifying attributes (such as zip-codes and disease name labels). Our results show that 
fc-anonymization problems in such cases are NP-hard, but there are a host of open problems relating to how 
well such multi-attribute problems can be solved approximately in polynomial time. 
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Figure 7: Overfull ratios for the FEMALE- 1990, MALE-1990, and LAST-1990 data sets, respectively. 
Maximum ratios are reported for each subrange with respect to four algorithms: Random Fold, which is 
the Fold algorithm on randomly-ordered data. Random Spread, which is the Spread algorithm on randomly- 
ordered data. Sorted Fold, which is the Fold algorithm on ordered data. Sorted Spread, which is the Spread 
algorithm on ordered data. 
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